-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix join on MultiIndex for mixed Datetimelike and string levels #32739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your proposed patch is very very specific to DTI/TDI conversions. Please try to narrow down exactly what is going wrong here, this is introducing a massive level of complexity.
@@ -3408,7 +3409,16 @@ def join(self, other, how="left", level=None, return_indexers=False, sort=False) | |||
|
|||
# have the same levels/names so a simple join | |||
if self.names == other.names: | |||
pass | |||
# For datetimelike levels, we maybe convert the corresponding level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa, we don't want to do anything like this.
The bug is very specific to DTI/TDI conversions. What I did here is reproduce what happen when joining a DatetimeLike index with another index, see code here pandas/pandas/core/indexes/datetimelike.py Line 833 in 23f8763
The reason this bug happen it that join between DatetimeLike Index and Object Index works, ie, the returned join_index has casted the object to the corresponding datetimelike, but then ,when calling get_indexer, pandas/pandas/core/indexes/base.py Line 3505 in a59fcc9
In brief, in Index.join, for multiindex, we join datetimelike and object together with maybe_casting of the object, but this doesn't happen at the get_indexer level.
That is what happen actually for joining DatetimeLikeIndex with ObjectIndex I am open to suggestion if one could tell me what we expect here. |
@nrebena if indices type are not the same then they must be joined as object |
@jreback This is actually not the case for single index, see this example, where the indice type differs, but the resulting type is DatetimeIndex. import pandas as pd
import datetime
i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])
print(i1.join(i2, return_indexers=True))
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None) Is it an acceptable solution to fix Index.join to return an object index in this case ? |
yes see what fixing this breaks (likely would need to depreciate this) |
Update on root cause and advice needed.
Are this results what should be expected for each one? For the one with unexpected result, what should be the right one? |
all of the above should be deprecated. We shouldn't every compare strings and datetimes. We may currently do this, but we need to deprecate it. |
@nrebena if you would like to update this to the above behavior would be fine; we should not be joining datetimelike things with non-datetimelike things at all. |
Sure thing. I'll try to get to it. |
I decided to split this, and started another PR here #33531, and will continue this one regarding the behaviour of pd.Categorical. |
closing this until / unless #33531 is resolved |
Problem
From #26558, a minimal example is:
Here we can see that one of the indexers for multiindex is wrong.
Proposed fix 🩹
The difference of handling for single index is explain by the join method for DatetimeTimedeltaMixin.
I did some refactoring in pandas/core/indexes/datetimelike.py to be able to reuse the same datetime handling method.
Then in the Index.join method, where MultiIndex are handled, I apply this method to each level of the MultiIndex that is a DatetimeTimedeltaMixin.
PR checkboxes ☑️
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff